Data Partitioning for Parallel Entity Matching
نویسندگان
چکیده
Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, blocking to reduce the search space for matching and parallel matching to improve efficiency. Special attention is given to the number and size of data partitions as they impact the overall communication overhead and memory requirements of individual match tasks. We have developed a service-based distributed infrastructure for the parallel execution of match workflows. We evaluate our approach in detail for different match strategies for matching real-world product data of different web shops. We also consider caching of input entities and affinity-based scheduling of match tasks.
منابع مشابه
Adaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملPartitioning Strategy Selection for In-Memory Graph Pattern Matching on Multiprocessor Systems
Pattern matching on large graphs is the foundation for a variety of application domains. The continuously increasing size of the underlying graphs requires highly parallel in-memory graph processing engines that need to consider non-uniform memory access (NUMA) and concurrency issues to scale up on modern multiprocessor systems. To tackle these aspects, a fine-grained graph partitioning becomes...
متن کاملDiverse information integration and visualization
This paper presents and explores a technique for visually integrating and exploring diverse information. Researchers and analysts seeking knowledge and understanding of complex systems have increasing access to related, but diverse, data. These data provide an opportunity to consider entities of interest from multiple informational perspectives not available from any single, data or information...
متن کاملOntology-Driven Data Partitioning and Recovery for Flexible Query Answering
Flexible Query Answering helps users find relevant information to their queries even if no exactly matching answers can be found in a database system. However, relaxing query conditions at runtime is inherently slow and does not scale as the data set grows. In this paper we propose a method to partition the data by using an ontology that semantically guides the query relaxation. Moreover, if se...
متن کاملA High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure
The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1006.5309 شماره
صفحات -
تاریخ انتشار 2010